From Central Bank of the Republic of Turkey’s EVDS Dataset, the target variable “Consumer Price Index (2003=100)(TURKSTAT) -> “PURCHASE OF VEHICLES” " is selected. The aim of this study to find other relevant data to predict the April 2021 value of our target variable with time series regression. To have more accurate prediction, I initially hypothesize that there would be strong relationship between the target variable and ‘’Weighted Average Interest Rates For Banks Loans For Vehicles’‘,’‘USDTRY Exchange Rate(Buying)’’, “The probability of buying a car (over the next 12 months)-Level” and “Seasonally unadjusted Consumer Confidence Index”.
Importing necessary libraries
library("readxl")
library("ggplot2")
library("tidyverse")
library("zoo")
library(plotly)
library("data.table")
library(forecast)
library(ggcorrplot)
Target variable is gathered as an excel file from the EVDS:
df <- read_excel("EVDS.xlsx")
str(df)
## tibble[,6] [112 × 6] (S3: tbl_df/tbl/data.frame)
## $ Tarih : chr [1:112] "2012-01" "2012-02" "2012-03" "2012-04" ...
## $ TP DK USD A YTL: num [1:112] 1.84 1.75 1.78 1.78 1.8 ...
## $ TP KTF11 : num [1:112] 15 14.8 13.4 13.2 13 ...
## $ TP FG J071 : num [1:112] 133 132 132 132 132 ...
## $ TP TG2 Y01 : num [1:112] 92.5 93.8 93.2 88.7 91.9 ...
## $ TP TG2 Y17 : num [1:112] 14.1 11.9 11.7 12.7 12.5 ...
drop_na(df)
#converting to data table
setDT(df)
There are no missing values.
In order to have better visualization, dates are converted into Date class of R.
df$Date <- as.yearmon(df$Tarih)
#USDTRY
p <- ggplot(data = df, aes(x = Date, y = `TP FG J071`,group=1))+
geom_line(color = '#007cc3') + labs(title = 'CPI - Purchase of Vehicles', x= "Date", y = "CPI - Purchase of Vehicles") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
Between 2012-2021, CPI for vehicles has a positive trend with more variance is observable after 2018. There isn’t any obvious seasonality exist, with only visual inspection. The data looks like exponential and more variance in the recent years are observed, therefore log transformation would be a good idea.
Fitting the simplest model with univariate linear regression with/without log transformation:
#trend:
df[,trend:=1:.N]
df$target_log <- as.numeric(log(df$`TP FG J071`))
fit <- lm(`TP FG J071`~trend, data = df)
summary(fit)
##
## Call:
## lm(formula = `TP FG J071` ~ trend, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.385 -25.640 -9.136 15.174 110.806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.1450 6.7665 10.37 <2e-16 ***
## trend 3.0552 0.1039 29.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.57 on 110 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.886
## F-statistic: 863.9 on 1 and 110 DF, p-value: < 2.2e-16
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 102.52, df = 10, p-value < 2.2e-16
#trend:
#log transformation
fit2 <- lm(target_log~trend, data = df)
summary(fit2)
##
## Call:
## lm(formula = target_log ~ trend, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11288 -0.05358 -0.01705 0.03442 0.21788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7054524 0.0143910 327.0 <2e-16 ***
## trend 0.0124016 0.0002211 56.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07564 on 110 degrees of freedom
## Multiple R-squared: 0.9662, Adjusted R-squared: 0.9659
## F-statistic: 3147 on 1 and 110 DF, p-value: < 2.2e-16
checkresiduals(fit2)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 94.44, df = 10, p-value = 7.021e-16
Comparing these two models, model with log transformed target variable yields better results, in terms of R-squared values and residual distribution. Log transformation causes lower variance in the model, thus better R-squared and adjusted R-squared. Both models reject null hypothesis with low p-values. Yet in both models, residuals don’t have zero mean, and have serial correlation, therefore there are still improvement areas to be realized. Continuing with log transformation, the trend line can be visualized:
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
In order to control seasonality and thus to obtain better regression, a ‘month’ variable can be introduced to the model.
#adding month variable:
month=seq(1,12,by=1)
df = cbind(df,month)
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names
## = check.names, : Item 2 has 12 rows but longest item has 112; recycled with
## remainder.
#linear regression
fit3 <- lm(target_log~trend+month, data = df)
summary(fit3)
##
## Call:
## lm(formula = target_log ~ trend + month, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10853 -0.05576 -0.01624 0.03871 0.22238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7162889 0.0191852 245.829 <2e-16 ***
## trend 0.0124080 0.0002215 56.025 <2e-16 ***
## month -0.0017619 0.0020596 -0.855 0.394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07573 on 109 degrees of freedom
## Multiple R-squared: 0.9665, Adjusted R-squared: 0.9658
## F-statistic: 1570 on 2 and 109 DF, p-value: < 2.2e-16
checkresiduals(fit3)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 94.519, df = 10, p-value = 6.773e-16
#
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Since the p-value is greater than 0.05, null-hypothesis is accepted for ‘month’ variable. I can say that, with only considering vehicle sales data, ‘month’ variable does not make the model better. Let’s focus on getting rid of autocorrelation by introducing lagged variables to the model. Starting with adding 2 lagged variables, for one month and two months for the residuals:
df$target_lagged_1 <-shift(residuals(fit2), -1)
df$target_lagged_2 <-shift(residuals(fit2), -2)
#linear regression
fit4 <- lm(target_log~trend+target_lagged_1+target_lagged_2, data = df)
summary(fit4)
##
## Call:
## lm(formula = target_log ~ trend + target_lagged_1 + target_lagged_2,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.106578 -0.010895 0.003173 0.011338 0.151340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.712e+00 4.992e-03 943.877 < 2e-16 ***
## trend 1.229e-02 7.838e-05 156.804 < 2e-16 ***
## target_lagged_1 1.209e+00 9.308e-02 12.984 < 2e-16 ***
## target_lagged_2 -2.919e-01 9.306e-02 -3.137 0.00221 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02561 on 106 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.996, Adjusted R-squared: 0.9958
## F-statistic: 8708 on 3 and 106 DF, p-value: < 2.2e-16
checkresiduals(fit4)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 8.2653, df = 10, p-value = 0.6029
#
Adding two lagged residuals gives a significantly better model, with residuals are more normally distributed and its mean is close to 0 and serial correlation is in the boundaries. To visualize prediction:
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
It can be seen that the predictor is nearly having the same curve observed in the real data. The overfitting can be a problem here.
In order to analyze overfitting and the relationship with the target variable and other variables mentioned above, new models and EDA should be made.
corr <-cor(df[,2:6])
ggcorrplot(corr,lab = TRUE)
With our target variable (‘TP FG J071’), USDTRY exchange rate(‘TP DK USD A YTL’) and Consumer Confidence Index(TP TG2 Y01) are strongly correlated. To put all these parameters in our model, also including trend parameter founded earlier we will find the optimum variables. At first, no lagged variables are introduced.
#df$target_lagged_1 <-shift(residuals(fit2), -1)
#df$target_lagged_2 <-shift(residuals(fit2), -2)
#linear regression
dolar = df$`TP DK USD A YTL`
survey = df$`TP TG2 Y01`
interest = df$`TP KTF11`
probability = df$`TP TG2 Y17`
fit5 <- lm(target_log~trend+dolar+survey+interest+probability, data = df)
summary(fit5)
##
## Call:
## lm(formula = target_log ~ trend + dolar + survey + interest +
## probability, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.123531 -0.025894 -0.000318 0.026989 0.101137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0846973 0.1256142 32.518 < 2e-16 ***
## trend 0.0065113 0.0004004 16.263 < 2e-16 ***
## dolar 0.1092799 0.0082012 13.325 < 2e-16 ***
## survey 0.0045491 0.0014295 3.182 0.00192 **
## interest 0.0049892 0.0009488 5.258 7.6e-07 ***
## probability 0.0049026 0.0030634 1.600 0.11249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04093 on 106 degrees of freedom
## Multiple R-squared: 0.9905, Adjusted R-squared: 0.99
## F-statistic: 2203 on 5 and 106 DF, p-value: < 2.2e-16
checkresiduals(fit5)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 67.971, df = 10, p-value = 1.091e-10
#
This model is also very succesfull in terms of residual mean and R-squared values (achieving 0.99 R-squared and adjusted R-squared); however,the probability of buying a car (over the next 12 months) fails to reject null hypothesis, autocorrelation is still a issue and residuals are not properly gaussian distributed. I discard ‘probability’ variable and adding one lagged variables(two lagged variables failed to reject null hypothesis):
df$target_lagged_1 <-shift(residuals(fit5), -1)
fit6 <- lm(target_log~trend+dolar+survey+interest+target_lagged_1, data = df)
summary(fit6)
##
## Call:
## lm(formula = target_log ~ trend + dolar + survey + interest +
## target_lagged_1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.108238 -0.012638 0.002346 0.015148 0.096767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1499302 0.0830129 49.991 < 2e-16 ***
## trend 0.0066920 0.0002735 24.471 < 2e-16 ***
## dolar 0.1054082 0.0054526 19.332 < 2e-16 ***
## survey 0.0045629 0.0008303 5.496 2.74e-07 ***
## interest 0.0047652 0.0006496 7.336 4.83e-11 ***
## target_lagged_1 0.7566791 0.0673338 11.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02801 on 105 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9952
## F-statistic: 4542 on 5 and 105 DF, p-value: < 2.2e-16
checkresiduals(fit6)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 12.246, df = 10, p-value = 0.2689
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
This model gives the best results achieved in this study. Comparing with one variable linear regression, this model yield the similar results in terms of R-squared and ACF, but residuals are more evenly distributed and deviates less than the mean. Also, visually it predicts the target variable very precisely and more resilient to overfitting with considering more than one variables and having less lagged values. Also, adjusted R-squared is not decreased with introducing more variables, it means that new introduced variables are important for predicting the target variable.
With this study, one can argue that CPI for Vehicles, USDTRY Exchange Rate and Interest Rates and Consumer Confidence Index are related to each other and they can be used for predicting CPI for Vehicles. Without using them, the regression can suffer from overfitting and lacks of normality and zero mean assumptions. Furthermore, CPI for Vehicles has not significant seasonality and has upwards trend. To obtain a more accurate model, external data can also be considered such as ÖTV raise for cars, amount of loan given by banks for buying vehicles, etc.